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Abstract 

We present the parallel and interacting stochastic approximation annealing (PISAA) algorithm, a 
stochastic simulation procedure for global optimisation, that extends and improves the stochastic ap¬ 
proximation annealing (SAA) by using population Monte Carlo ideas. The standard SAA algorithm 
guarantees convergence to the global minimum when a square-root cooling schedule is used; however the 
efficiency of its performance depends crucially on its self-adjusting mechanism. Because its mechanism is 
based on information obtained from only a single chain, SAA may present slow convergence in complex 
optimisation problems. The proposed algorithm involves simulating a population of SAA chains that in¬ 
teract each other in a manner that ensures significant improvement of the self-adjusting mechanism and 
better exploration of the sampling space. Central to the proposed algorithm are the ideas of (i) recycling 
information from the whole population of Markov chains to design a more accurate/stable self-adjusting 
mechanism and (ii) incorporating more advanced proposals, such as crossover operations, for the explora¬ 
tion of the sampling space. PISAA presents a significantly improved performance in terms of convergence. 
PISAA can be implemented in parallel computing environments if available. We demonstrate the good 
performance of the proposed algorithm on challenging applications including Bayesian network learning 
and protein folding. Our numerical comparisons suggest that PISAA outperforms the simulated anneal¬ 
ing, stochastic approximation annealing, and annealing evolutionary stochastic approximation Monte 
Carlo especially in high dimensional or rugged scenarios. 

Keywords: Stochastic approximation Monte Carlo, simulated annealing, population Markov chain Monte 
Carlo, local trap, stochastic optimisation 
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1 Introduction 


There is a continuous need for development of efficient algorithms to tackle mathematical optimisation 
problems often met in several fields of science. For instance, in computational chemistry, predicting the 
native conformation of a protein can be performed by minimising its potential energy. In classical or Bayesian 
statistics, inference can be performed by maximising the likelihood function (a statistical model assumed 
to have generated an observed data set) (Casella and Berger 1990[| or the associated posterior distribution 


density (a distribution that reflects the researcher’s belief in the unknown quantities of interest) (Robert 


20071, correspondingly. 


We assume that there is interest in minimising a function U{x), called cost function, defined on a space 
X d i.e. we seek (a;*, t7(x,|=)) such that = argminy^eAr C7(x). Hereafter, we will discuss in terms 
of minimisation because maximisation of U{x) can be performed equivalently by minimising the function 
lJ{x) := —U{x). Several stochastic optimisation algorithms have been proposed in the literature, e.g. simu¬ 
lated annealing (SA) (Kirkpatrick et al. 1983 Metropolis et al.[ 1953[|, genetic algorithm (Goldberg 1989 


Holland 1975[|, annealing stochastic approximation Monte Carlo (ASAMC ) (Liang 20071, annealing evolu¬ 


tionary stochastic approximation Monte Carlo (AESAMC) (Liang, 20111, stochastic approximation annealing 


(SAA) (Liang et al. 20141. Albeit their success, they encounter various difficulties in converging to the global 
minimum, an issue that becomes more severe when [/(•) is highly rugged or high dimensional. 


Simulated annealing (SA) (Kirkpatrick et al. 


1983 


Cerny 19851 aims at finding the global minimum 


based on the fact that minimisation of U{x) can be addressed in statistical terms by simulating the Boltzmann 
distribution /t,^(x), with density fT^{x)cc exp{— ^U{x)), at a small value of temperature parameter > 0 
close to 0. SA considers a temperature ladder {tj} that is a monotonically decreasing sequence of temperat¬ 
ures with Ti reasonably large. A standard version of SA involves simulating consecutively from a sequence 
of Boltzmann distributions {frf(x);t = 0,1,...}, parametrised by the temperature ladder, via Metropolis- 
Hastings MCMC updates ( Hastings} 1970, Metropolis et al., 19531. A standard version of SA is presented in 
Algorithm[T]as a pseudo-code. At early iterations, the algorithm aims at escaping from the attraction of local 
minima by flattening fnix) through tj. During the subsequent iterations, Tt decreases progressively towards 
0, and hence the values simulated from fr^ (x) concentrate in a narrower and narrower neighbourhood of 
the global mode of fnix) (or equiv. the global minimum of U{x)). In theory, convergence of SA to the 
global minimum can be ensured with probability 1 if a logarithmic cooling schedule 0(l/log(t)) is adopted 
(Geman and Geman 1984, Haario and Saksman, |1991 1, however this rate is too slow to be implemented in 
practice because it requires an extremely long GPU time. In practice, linear or geometric cooling schedules 
are used, however they do not guarantee convergence to the global minimum, and hence the algorithm tends 
to become trapped to local minima in complex scenarios. 


The stochastic approximation annealing (SAA) (Liang et al., 20141 is a stochastic optimisation algorithm 
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Algorithm 1 Simulated annealing algorithm used to detect the minimum of a cost function U{x), x e X 
Requires : Seed Xg e A, temperature ladder {xt}, density frt{x)ccexp{—^U{x)). 

Initialise : At < = 0, set Xg G X, and tq > 0. 

Iterate : For t = 1,T, 

For rit iterations repeat simulating /tj(-) by using a Metropolis-Hastings algorithm: 

1. Propose x' ~ (3(d • |x), where Q(d • j-) is a proposal distribution that can be sampled directly. 

2. Accept x' as Xt with prob. omh = min(l, Q(x'Pt^i) )- 


that builds upon the SA and SAMC[^ ideas. It involves simulating a time-inhomogeneons Markov chain via 
MCMC transitions targeting a seqnence of modified Boltzmann distributions whose densities adaptively 
adjnst via a stochastic approximation mechanism (inherited by SAMC). Each distribution of the seqnence is 
biased according to a partitioning scheme (inherited by SAMC) and parametrised by a temperatnre ladder 
(inherited by SA). SAA aims at gradnally forcing sampling toward the local minima of each snbregion of the 
partition through lowering the temperatnre with iterations, while it ensnres that each snbregion is visited 
by the chain according to a predetermined frequency. This strategy shrinks the sampling space in a soft 
manner and enables SAA to escape from local traps. The global minimum is guaranteed to be reached as 


the temperatnre tends to 0 if the temperature ladder uses a square root cooling schedule 0{l/^/i) (Liang 


et al., 20141. We emphasise that, compared to SA, SAA ensnres convergence to global minimnm at a much 


faster cooling schedule (square-root). In spite of these appealing featnres, the performance of SAA crucially 
depends on the efficiency of the self-adjnsting mechanism and the exploration of the sampling space involved. 
In scenarios that the cost function is rngged or high-dimensional, the exploration of the sampling space can 
be slow becanse it is performed by a single Markov chain. Moreover, the information obtained to snpport 
the self-adjnsting process is limited which makes the adjustment of the target density qnite unstable and too 
slow to convergence. When the target distribntion is poorly adjnsted, the convergence of the whole algorithm 
to the global minimum decays severely, and the chain may be trapped in local minima. This problematic 
behaviour can downgrade severely the overall performance of SAA, or even canse local trapping, in complex 
optimisation problems. 


^SAMC jLiang et al.[[2007[|20ld||Wu and Liang||201l[|Bomn et al.||2013[|Song et al.[|2014^ is an adaptive MCMC sampler 
that aims at addressing the local mode trapping problem that standard MCMC samplers encounter. It is a generalisation of 
the Wang-Landau algorithm | |Wang and Landau| |2001| | but equipped with a stochastic approximation scheme ( [Robbins and| 
[Monroj |1951| | that adjusts the target distribution. It involves generating a time-inhomogeneous Markov chain that targets a 
biased distribution, adjusted as the iterations evolve, instead of the distribution of interest itself. The biased distribution is 
parametrised by a partition scheme and designed such that the generated chain equally visits each subregion of the partition 


with a predetermined frequency as the iterations evolve. For an overview see I Liang 20141. 
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In this article, we develop the parallel and interacting stochastic approximation annealing (PISAA), a 


general purpose stochastic optimisation algorithm, that extends SAA (Liang et al. 20141 by using population 
Monte Carlo ideas (Song et al. 2014, Bornn et al. 2013 Liang and Wong 2000, 2001 Wu and Liang 20111. 
Essentially, PISAA works on a population of SAA chains that interact each other in a manner that eliminates 
the aforementioned problematic behaviour of SAA, and accelerates the overall convergence. This allows the 
proposed algorithm to demonstrate great performance, and address challenging optimisation problems with 
high-dimensional and very rugged cost functions. PISAA is enabled to use advanced MCMC transitions that 
incorporate crossover operations. These operations allow the distributed information across chains of the 
population to be used in guiding further simulations, and therefore lead to a more efficient exploration of the 
sampling space. Furthermore, PISAA is equipped with a more accurate and stable self-adjusting mechanism 
for the target density, that uses information gained from the whole population, and therefore accelerates the 
overall convergence of the algorithm to the global minimum. The use of multiple chains allows PISAA to 
initialise from various locations and search for the global minimum at different regions of the sampling space 
simultaneously. PISAA can be implemented in parallel, if parallel computing environment is available, and 
hence the computational overhead due to the generation of multiple chains can be reduced dramatically. It is 
worth emphasising that PISAA is not just an implementation of the SAA running in parallel; its key feature 
is the way the parallel chains interact in order to overcome the aforesaid problematic behaviour and improve 
performance. Our numerical examples suggest that the performance of PISAA improves with the size of the 
population. Also, in problems where the cost function is rugged or high-dimensional, PISAA significantly 
outperforms other competitors, SA, ASAMC, and SAA, and their population analogues, VESA, AESAMC, 
as it was able to discover the global minimum much quicker. 

The layout of the article is as follows. In Section we give a brief review of SAA and discuss problems 
concerning the efficiency of the algorithm; in Sectionj^ we present the proposed algorithm PISAA; in Section 
we examine the performance of the proposed algorithm and compare it with those of other stochastic op¬ 
timisation algorithms (such as SA, ASAMC, AESAMC, and SAA) against challenging optimisation problems; 
and in Section we conclude. 


2 Stochastic approximation annealing: A review 


Stochastic approximation annealing (SAA) algorithm (Liang et al. 20141 casts the optimisation problem 
in a combined framework of SAMC and SA, in the sense that the variant distribution is self-adjusted and 
parametrised by a sampling space partition and temperature ladder. 

Let £ = {Ej\ j = l,...,m} be a partition of the sampling space X with subregions Ei = {x e X ■. 
—00 < U{x) ^ Ml), ..., Ej = {x e X : Uj-i < U{x) ^ Uj), ..., E^ = {x e X : Um-i < U{x) < oo), and 
grid {uj] Uj e K, j = 1 : m — 1}, for to > 1. SAA aims at drawing samples from each subregion with a 
pre-specified frequency. Let tt := (tt^; j = 1,...,to), such that ttj = Pr(a; e Ej), wj > 0 and XijLi’’’i ~ 
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denote the vector of desired sampling frequencies of the m subregions {Ej}. We refer to as the desired 
probability. How to choose the partition scheme 8 for the sampling space or the desired probability } are 
problem dependent. SAA seeks to draw samples from the modified Boltzmann distribution with density 


1 1 

fe^,T^{x]£) = ^TTj—^exp(- U{x))l{x £ Ej)-, 

m I 

oc V exp(- U{x) - e Ej), 

3=1 


( 2 . 1 ) 


at a low temperature value where := = 1 : m), exp(— < qq are called 

bias weights, and 9^^ is such that exp(0^'^^)ocr(;^'^^/7rj, for j = 1, ...,m. 


The rational behind SAA is that, if were known, sampling from (2.11 could lead to a random walk 
in the space of subregions (by regarding each subregion as a point) with each subregion being sampled with 
frequency proportional to Ideally, this can ensure that the lowest energy subregion can be reached by 

SAA in a long enough run and thus samples can be drawn from the neighbourhood of the global minimum 
when is close to 0. 

Since are generally unknown, in order to simultaneously approximate these values and perform 

sampling, SAA is equipped with an adaptive MCMC scheme that combines SAMC and SA algorithms. Let 
{ 7 t; t = 1,...} denote the gain factor, in terms of SAMC algorithm, that is a deterministic, positive, and 
non-increasing sequence such as jt = to/t^ with /3 e (0.5,1]. Let {rt} denote a temperature ladder, in terms 
of SA algorithm, that is a deterministic, positive and non-increasing sequence such as Tt = ti/V^ + T:i: with 
ti > 0, and > 0 very small. We consider a sequence 9t '■= j = 1 : m), as a working estimator of 

(0,),}, where 6t e Q and 0 c K™ is a compact set, e.g. 0 = [10“^°, 10^°]"*. A truncation mechanism is also 
considered in order to ensure that {9t} remains in compact set 0. We define {Me, c = 1,...} as a positive, 
increasing sequence of truncation bounds for {9t}, and {c*} as the total number of truncations until iteration 
t. 


SAA algorithm proceeds as a recursion which consists of three steps, at iteration t: The sampling up¬ 
date, where a sample Xt is simulated from a Markov chain transition probabilities Pef_i^Ttixt,d-; 8) (e.g. a 
Metropolis-Hastings kernel) with invariant distribution /et_i,Tt(d-; 8); the weight update, where the unknown 
bias weights of the target density are approximated through a self-adjusting mechanism; and the truncation 
step, where {0*} is ensured to be in a compact set of 0. Given the notation above, SAA is presented as a 
pseudo-code in Algorithm 
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Algorithm 2 Stochastic approximation annealing algorithm 
Requires : Insert {r*}, { 7 *}, {Ej}, {Mk}, Oq 

Initialise : At t = 0, set xq e X, Oq e 0, such that l^ob < A/q, and cq = 0. 

Iterate : For t = 1, 

1. Sampling update: 

Simulate Xt from the Metropolis-Hastings transition probability £) that targets 

2. Weight update: 

Compute 0' = Ot-i + Xt), where Hr^iOtjXt) = [pt - tt], pt := {p[^\j = 1 : to), and 

= l(a:t e Ej) for j = 1, ...,to. 

3. Truncation step: 

Set 6t = 9', and ct = Ct-i if < Mc^, or set 9t = Oq, and c* = Ct_i + 1 if otherwise. 


Note that additive transformations of {9t} leave invariant. Therefore, it is possible to apply a 

0-normilisation step at the end of the run, such that <— + z, where exp(9n^ + z) = Z, and Z 

is a pre-specified constant, e.g. z = (—log(Xi^i exp(0l^^)); j = 1 : to) for Z = 1 . Appropriate conditions 
under which SAA is a valid adaptive MCMC algorithm that converges to the global minimum are reported 


in detail in (Conditions A1-A3 in Liang et al., 20141. 


SAA presents a number of appealing features when employed to minimise complex systems with rugged 
cost functions. SAA can work with an affordable square-root cooling schedule 0(1/Vt) for {rd, which 
guarantees the global minimum to be reached as the temperature tends to w 0, limt_>oo Tt = It is able 
to locate the minima of each subregion simultaneously (including the global minimum), after a long run, 
if is close to 0 (Corollary 3.1 in Liang et al., 2014[ |. It is worth mentioning that the square-root rate 
is much faster than the logarithmic rate that guarantees convergence in the SA algorithm. SAA gradually 
forces sampling toward the local minima of each subregion of the partition through lowering the temperature 
with iterations while it ensures that each subregion is visited by the chain according to the predetermined 
frequency this reduces the risk of getting trapped into local minima. 

The superiority of SAA is subject to its self-adjusting mechanism that operates based on the past samples 
in order to estimate the unknown {0*}. This remarkable mechanism, which distinguishes SAA from SA, 
proceeds as follows: Given that the current state of the Markov chain is at the subregion Ej and that a 
proposal has been made to jump to subregion Eji, if the proposal is rejected during the sampling update, the 
working value ^ will be adjusted to increase during the weight update and make it easier to be accepted in 
the next iteration; if otherwise, 6 *^ ^ will be adjusted to decrease during the weight update step and make it 
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harder to be accepted in the next iteration. Essentially, it penalises the over-visited subregions and rewards 
the under-visited subregions, and hence makes easier for the system to escape from local traps. This striking 
mechanism makes the algorithm appealing to address optimisation problems with rugged cost functions. 

Although SAA can be quite effective, its success depends crucially on whether the unknown bias weights 
{9t} can be estimated accurately enough through the adjustment process, and whether the Markov chain, 
generated through the sampling step, can explore the sampling space adequately. In complex problems where 
the ruggedness or the dimensionality of the cost function are high, the convergence of {Ot} is usually slow; 
an issue that significantly downgrades the overall performance of SAA. The reason is that, at each iteration, 
the self-adjusting process relies on limited information obtained based on a single draw from the sampling 
step. Essentially, the function Xt) is computed by only one single observation: at iteration t, pt 

in Algorithm is an m-dimensional vector of 0 & 1 (occurrence & absence) indicating to which subregion 
the sample Xt belongs. Even after a long run, this can cause a large variation on the estimate of {9t} and 
slow down severely the convergence of {Ot}, especially if the number of subregions m is large. Consequently, 
the adjustment of the target density becomes quite unstable and the self-adjusting mechanism becomes less 
effective. That can slow down the convergence of SAA, or even cause the chain to be trapped in local minima. 
This problematic behaviour can downgrade severely the ability of SAA to discover the global minumun in 
challenging optimisation problems. 

Because SAA presents appealing properties, it is of great importance to design an improved algorithm 
that inherits the aforementioned desired features and eliminates the aforementioned problematic behaviour 
of SAA. 


3 Parallel and interacting stochastic approximation annealing 

The parallel and interacting stochastic approximation annealing (PISAA) builds on the main principles of 


SAA (Liang et al. 20141 and the ideas of population MC ( Song et al.[ 2014, Bornn et al., 20131. It works on a 
population of parallel SAA chains that interact each other appropriately in order to facilitate the the search 
for the global minimum by improving the self-adjusting mechanism and the exploration of the sampling 
space. In what follows, we use the notation introduced in Section 


3.1 The procedure 

PISAA works with a population of samples at each iteration. At iteration t, let := i = 1 : k) 

denote the population of samples (abbr. population) which is defined on the population sample space 
A'' := X X ... X A. We refer to x^*^ as population individual and assume that xj*^ e A, for i = 1 ,..., k, 
where A e is called marginal sample space. The total number of population individuals k > 1 is called 
population size. 
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We assume that the whole population shares the same common partition scheme S = {Ej\ j = 1 : m) 
with subregions {Ej} defined according to a grid {uj; uj e R, j = 1 : m — 1}, as in Section For 
each individual, PIS A A aims at drawing samples from each subregion {Ej} with a desired probability tt := 
{nji j = 1, defined as in Section]^ Thus, under these specifications, we define a population modified 

Boltzmann distribution with density 




(3.1) 


i = l 
K m 


= n 7^1-777 exp(- — [/(xW))l(xW e Ej); 


i=ij=i 

Km w 

ocH Xi 


i=lj=l 


G Aj), 


where and are defined as in Section]^ Note that, the individuals of the population 

are independent and identically distributed (i.i.d.) such that each individual has marginal distribution 

"the SAA target distribution. Moreover, that the 
total number of the unknown weights {0^^} is invariant to the population size. The reason why we consider 
the individuals to be i.i.d. (share common £, { 0 ^'^^}) will become more clear later in the section. 

PISAA aims at simulating from the distribution /gX^(d-; £) at a low temperature > 0. The reason 


is similar to that of SAA: if {0*'^^} were known, sampling from (3.1 1 could lead to a random walk in the space 
of subregions with each subregion being sampled with frequency proportional to for each individual. 

Ideally, this can ensure that the lowest energy subregion can be reached, and thus samples can be drawn 
from the neighbourhood of the global minimum when is close to 0. Because {01'^^} are unknown, PISAA 


employs a population SAMC (Song et al. 2014 Bornn et ah, 20131 embedded with the SA in order to 
simultaneously approximate their values and sample the population. Therefore, we consider a sequence of 
population modified Boltzmann distributions rt(ds ^)} with density 


ft, III -| 

/i;^^_,^(d.; £)«n X exp(--C/(a:«) - 0P)l(cr« G E,), (3.2) 

i=ij=i * 

where the temperature sequence {rt}, gain factor { 7 *}, working estimates {9t} are defined as in Section]^ 
PISAA is a recursive procedure that iterates three steps: the sampling update, the weight update, and the 
truncation step. Although the structure of PISAA is similar to that of SAA, the sapling update and weight 
update are different and in fact significantly more efficient. 

The sampling update, at iteration t, involves simulating a population of k chains from a Markov transition 
probability Pg^\ T.j(’id-;£) that admits invariant distribution. The Markov transition 

probabilities {Pg‘^\ ^.^(’,d-;£")} can be designed as a mixture of different MCMC kernels. Because it uses 
a population of chains, PISAA allows the use of advanced updates for the design of these MCMC kernels 
which facilitate the exploration of the sampling space and the search for the global minimum. Two types of 
such operation updates are the mutation, and the crossover operations. 
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Mutation operations update the population individual-by-individual through Metropolis-Hastings within 
Gibbs algorithm (Mullerf 1991} Robert and Casella 20041 by viewing the population as a long vector. 


Because the population individuals in (3.2) are independent and identically distributed, in practice the 


whole population can be updated simultaneously (in parallel) by using the same operation with the 
same bias weights for each individual. This eliminates the computational overhead due to the genera¬ 
tion of multiple chains. Parallel chains allow breaking the sampling into parallel simulations, possibly 
initialised from different locations, which allows searching for global minimum at different subregions 
of the sampling space simultaneously. Moreover, it avoids the need to move a single chain across a 
potentially large and high modal sampling space. Therefore, it facilitates the search for the global 
minimum and the exploration of both the sample space and partition space, while it discourages local 


trapping. They include the random walk Metropolis (Metropolis et ah, 1953), hit-and-run (Smith 


1984 

Ghen and Schmeiser 

1996), /c-point ( 

Liang 

2011 Liang and Wong 2001 

2000), Gibbs (Muller 

1991 

Geman and Geman 

1984) updates etc. 


Crossover operations, originated in genetic algorithms (Holland, 1975), update the population through 


a Metropolis-Hastings algorithm that operates on the population space and constructs the proposals 
by using information from different population chains. Essentially, the distributed information across 
the population is used to guide further simulations. This allows information among different chains 
of the population to be exchanged in order to improve mixing. As a result, crossover operations can 


facilitate the exploration of the sample space. Crossover operations include the fc-point (Liang 2011 


Liang and Wong 2001 20001, snooker (Liang, 2011 Liang and Wong 2001 2000, Gilks et al. 1994), 


linear (Liang 2011 Liang and Wong 2001 2000, Gilks et al. 1994) crossover operations etc. 


The weight update aims at estimating {0^^} by using a mean field approximation at each iteration 
with the step size controlled by the gain factor. It is performed by using all the population of chains: At 
iteration t, the update of is performed as 9' = 9t-i + where = 

- tt], := = 1 : m), and ^ li'Li e Ej), for 

j = l,...,m. Intuitively, because all the population chains share the same partition £ and bias weights 
{ 0 , 1 ,}, and the population individuals are independent and identically distributed, the indicator functions of 
Pt (used in Algorithm can be replaced here by the proportion p^^^ of the population in the associated 
subregions at each iteration. Namely, the indicator functions of pt (in Algorithm]^ is replaced by the law of 
the MCMC chain associated with the current parameter. A theoretical analysis in Appendixshows that 
the multiple-chain weight update (in Algorithm is asymptotically more efficient that the single-chain one 
(in Algorithm]^. 

The truncation step applies a truncation on 9t to ensure that 9t lies in a compact set 0 as in SAA; hence 
we consider quantities 6q, {Me}, and {ct} as in Section]^ 

The proposed algorithm works as follows: At iteration t, we assume that the Markov chain is at 


Parallel and Interacting Stochastic Approximation Annealing algorithms for global optimisatic 
Georgios Karagiannis, Bledar A. Konomi, Guang Lin, and Faming Liang 



































































































Algorithm 3 Parallel and interacting stochastic approximation annealing algorithm 
Requires : Insert {r*}, { 7 *}, {Ej}, {tTj}, {MJ, k, Oq 

Initialise : At t = 0, set e 9q e 0, snch that || 0 o ||2 < Mq, and cq = 0. 
Iterate : For t = 1, 


1. Sampling update: 

Simulate from the Metropolis-Hastings transition probability P 0 '^\ £) that tar¬ 

get 

2. Weight update: 

Compute e' = 6 » 4 _i where -tt], := = 

1 : m), and ^ ^ Ej), for j = 1, 

3. Truncation step: 

Set 6 t = 9', and ct = Ct-i if ^ or set 9t = 9 q, and c* = Ct_i + 1 if otherwise. 


state with a working estimate 9t-i. Firstly, simulate a population sample from the Markov 

transition probability Pg'^^^ ^^{x[]^i\d-;£) . Secondly, update the working estimate 9t according to 9' = 
9t-i + 7 tFI.fr^( 6 »t_i,a;|^'”^), where H^rf’{9t-i,x\^''^'’) = - tt], := {pi'"'^\j = 1 : m), and = 

K ljr=i e Ej), for j = l,...,m, by using the whole population {x[^''^^}. Thirdly, if || 0 '^aO ||2 ^ 

truncate such that 9t = 9 q, and c* = Ct_i + 1. At the end of the run, t = n, it is possible to apply a 
^-normalisation step (see Section -an alternative ^-normalisation step can be 9n ^ <— 9n ^ + z, where 
z = — log(2”l]^ TTj exp(0l^^)). PISAA is summarised as a pseudo-code in Algorithmj^ A more rigorous ana¬ 
lysis about the convergence and the stability of PISAA is given in Appendix and summarised in Section 

ixn 


3.2 Theoretical analysis: a synopsis 

Regarding the convergence of the proposed algorithm, PISAA inherits a number of desirable theoretical 
results from SAA (Liang et al. 20141 and pop-SAMC ( [Song et'ah 20141. A brief theoretical analysis related 
to the convergence of PISAA is included in Appendix where we show that theoretical results of |Song| 
|et al.|20T^for pop-SAMC hold in the PISAA framework as well, and we present theoretical results in |Liang| 


et al. (20141 for SAA that hold for PISAA as well. The Theorems A.l A.2 A.4 and A.5 as well as related 


conditions on PISAA, are included in the Appendix [ a} We recall, the temperature ladder: Tt = ti/Vt + t^, 
ti > 0, the gain function: 74 = , to > 0, f3 e (0.5,1), and consider that := 

denotes a draw from PISAA at the t-th iteration. 
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PISAA can achieve for any individual the following convergence result: For any e > 0, as t 

^ 0 

P(;7(A«) ^ + e| J(A«) = j) - 1, a.s., 


00 , and 


where J{x) = j if x e Ej, and u* = minj;^^;^ U{x), for j = 1, Namely, as the number of iterations 

t becomes large, PISAA is able to locate the minima of each subregion in a single run if is small. This 
comes as a consequence of |Liang et al. (2014 Corollary 3.1) and the Theorems A.l and A.2| in Appendix [A| 
Theorem A.l in Appendix [A| (a restatement of Theorems 3.1 and 3.2 of Liang et al. ( |2014[ |) indicates that 
the weights {0t} remain in a compact subset of 0 and hence 0,,= = {0^^;j = 1,..., m) can be expressed in the 
form = c + log(^^ exp(—C/(a;(*))/r,i=)da;(*)) — log(7rj), for j = 1, ...m, and any i = 1 ,..., k, where c e K 
is an arbitrary constant. Namely, as t ^ oo, 1'^) 1'^)’ since |£) 

is invariant to transformations 0 <— 0 + c. Furthermore, Theorem |A.2| in Appendix (a restatement of 
Theorem 3.3 of ^ 


Liang et al. 


(2014l) implies that ~ lif), in a SLLN fashion; where A, 


(1:k) 

t+1 


a draw from PISAA at the {t + l)-th iteration. 


It is not trivial to show that the results of (Song et al. 20141 for pop-SAMC hold in the PISAA framework 
as well. The reason is that, unlike in pop-SAMC, in the PISAA framework the target distribution is 
parametrised by an additional control parameter the temperature ladder {tj}, and hence the density of the 
target distribution changes at each iteration. I.e. ^ if t A t' in the PISAA framework. 

In Appendix [A| Lemma |A.3| considers the decomposition of the noise in the PISAA framework, and allows 


us to be able to extent the main theoretical results of (Song et al. 20141 to the PISAA framework as stated 
in Theorems A.4 and A.5 in the Appendix Theorem A.4 implies that the weights {6t} generated by 
PISAA are asymptotically distributed according to the Gaussian distribution, and constitutes an extension 
of (Theorem 2, Song et aT] 2014[ | in the PISAA framework. Theorem A.5 considers the relative efficiency 
of the bias weight estimate {0f} generated by the self-adjusting mechanism of the multiple-chain PISAA 
(with population size k) at iteration t, against estimate {0%t} generated by the self-adjusting mechanism of 
the single-chain SAA at iteration k ■ t. Theorem A.5 implies that (0^ — 0if )/and (0^( — 0 ^)follow 
the same distribution asymptotically with convergence rate ratio where /? e (0.5,1], and hence is the 


extension of (Theorem 4, Song et al., 20141 in the PISAA framework. 

In other words, when /3 < 1, the multiple-chain PISAA estimator of the bias weights is asymptotically 
more efficient than that of the single-chain SAA; while when /3 = 1, the two estimators present similar 
efficiency. In practice, PISAA estimator is expected to outperform the single-chain SAA estimator even 
when /3 = 1 because of the so called population effect; the use of multiple-chains to explore the sampling 


space and approximate the unknown {0t}- Theorem A.5 implies rigorously that the adjustment process in 
PISAA is more stable than that in SAA. 
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3.3 Practical implementation and remarks 


Liang et al. (2014[) discussed several practical issues on the implementation of SAA (including the algorithmic 


settings { 7 *}, {t*} {M^}, n ) that are still applicable to PISAA. Here, we adopt these algorithmic 

settings, i.e.: tt^oc exp(-A(j-l)) with ( > 0; 7 * = with /3 e (0.5,1]; n = 

where r,,, > 0, t/j > 0, and > 0; and Me = 10^°Mc_i with Mg = 10^°°. We briefly discuss additional 
practical details of PISAA: 


• The population seed controls the initialisation of the population. It is preferable, but not ne¬ 

cessary, for the population of chains to initiate from various locations, possibly around different local 
minima. This could benefit the exploration of the space and the search for the global minimum. This 
can be achieved, for example, by sampling from a flat distribution e.g. fro{x)<xexjp(—U{x)/TQ), with 
To > 0 large enough, via a random walk Metropolis algorithm. 


The MCMC operations must result in reasonable expected acceptance probabilities because they can 
affect the sampling update. It is possible to calibrate the scale parameter of the proposals adaptively 


(on-the-ffy) by using an adaptation scheme (Andrieu and Thoms 20081, during the first few iterations. 


• The rates of the operations in the MCMC sweep at each iteration are problem dependent. One may 
favour specific operations by increasing the corresponding rates if it is believed that they are more 
effective or cheaper to run for the particular application. 


PISAA can be modified to deal with empty subregions similar to SAA. Let St denote the set of non-empty 

O rt 

subregions until iteration t, Ot * denote the sub-vector of 0t corresponding to elements of St, and 0 ** denote 
the sub-space of 0 corresponding to elements of St- Yet, let yC-") denote the proposed population value 
generated during the sampling update, and J{x) = j if x e Ej. Then Algorithm]^ can be modified as follows: 


(Sampling update): Simulate 
{J( 2 /W); i = 1, 


r)(^) 


{x^tS!^ (as in Algorithm 


31, and set St <— St-i u 


• (Weight update): Compute +-ttHi'^\e^^\,x^t'-^'^), for j e S^*\ 


• (Truncation step): Set 9t = 6', and ct = Ct-i if || 0^‘^*||2 ^ ^ct, or set 9t = 9q, and Ct = Ct-i -I- 1 if 
otherwise. 


This modification ensures {9t} to remain in a compact set. Note that the desired sampling distribution 
becomes actually + 7 re;for j = 1 : m}, and 

f 

C-f log(^ exp(-C/(a;)/TH=)da;) - log(7ri 4- TTe) , if 7 ^ 0 
~9f , if A, = 0 

where Te = 0 / ||5'oo||, and Soo is the limiting set of St- 


12 


Parallel and Interacting Stochastic Approximation Annealing algorithms for global optimisation 
Georgios Karagiannis, Bledar A. Konomi, Guang Lin, and Faming Liang 


















For population size k = 1, PISAA is identical to the single-chain SAA. 

PISAA can be used, in the same spirit as the tempered transitions ( Neal[ 19961, for sampling from a 
multi-modal distribution /(d-). One can run PISAA with U{x) := — log(/(a:)), 

Th > 1, r,|; = 1, and collect the sample Xn ' ■ Then, inference can be performed by importance sampling 
methods due to Theorems | A. 1 1 and | A.2 1 in Appendix [A] 


4 Applications 

We compare the performance of PISAA with those of other stochastic optimisation procedures such as the 
simulated annealing (SA) ( Kirkpatrick et ahj 19831, very fast simulated re-annealing (VFSA) (Ingber, 1989[ 
Sen and Stoffa 1996[ Jackson et al. 20041, annealing stochastic approximation Monte Carlo (ASAMC) 
(Liang, 20071, annealing evolutionary stochastic approximation Monte Carlo (AESAMC) (Liang, 2011[| , and 


stochastic approximation annealing (SAA) (Liang et al., 20141. 

As a performance measure, we consider the average best function value discovered by the algorithm. 
We perform 48 independent realisations for each simulation, and average out the values of the performance 
measures, in order to eliminate nuisance variation in the output of the algorithms (caused by their stochastic 
nature or random seeds). To monitor the convergence of PISAA and the stability of its self-adjusting 

, where 


mechanism, we consider the MSE of the bias weights as in (Song et al. 


Wt := = 1 : m), := ^U{x)dx are the real values of the bias weights, and 9\'^’ are the 

estimates of Wt approximated by the self-adjusting mechanism of PISAA with population size k. 

The mutation operations and crossover operations, used in the examples, are presented in Appendix [B] 
as pseudo-codes. 


„(i) 


20141 MSE := 


Oi - Wt 
Jk) 


4.1 Gaussian mixture model 

We consider the Gaussian mixture with density 

20 

fi{x) = ^ zUi'N 2 {x\fj.i, cr^)l(x G X), (4.1) 

2=1 

where x e A = [—10^°, 10^°]^, cr^ = 0.001, {rui = 1/20; i = 1 : 20}, and are given in (Table 1 in 


Liang and Wong 20011. Sampling from (4.11 is challenging because this distribution is multi-modal and has 


several isolated modes. Here, our purpose is to check the validity of PISAA instead of optimisation. 

We consider default algorithmic settings for PISAA: (i) energy function Ui{x) = — log{fi{x)), (ii) uni¬ 
formly spaced grid {uj} with m = 19, ui = 0, and uig = 9.0, (hi) gain factor {yd with = 100, /3 = 0.55, 
(iv) temperature sequence {rt} with = 1, = 5, and r,,, = l — Th^/^Jn, and (v) MCMC transition prob¬ 

ability that uses mutation operations (Metropolis, hit-and-run, /c-point) and crossover operations (A:-point, 
snooker, linear), with equal operation rates, and proposal scales calibrated so that the expected acceptance 
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probabilities to be around 0.234. At the end of the simulation, at iteration n = 10®, the temperature will be 
T„ = 1; and hence one may see this example as tempered transition sampling from multi-modal distribution 
/i(d-) via PISAA. 

We run PISAA with different combinations of population size k e {1,...,30} and gain factor power 
/3 e {0.55,0.65,0.75,0.85,0.95,1.0}. Each of these runs was repeated for 100 realisations in order to com¬ 


pute the estimates, error bars, mean square error MSE := 


Oi ^ - Wt 


, and relative efficiency RE(k; f3) := 


(k) 

[h/kJ 


( 1 ) 


of the bias weights. Note, that the bias weights are estimated by the self-adjusting 

(kj) 


mechanism of PISAA using the ^-normalisation step exp(0n + z) = 1. 


Figure 


4.1a 


presents the estimates of the bias weights for j = 1,..., 6, and n = 10®, as produced by 


the self-adjusting mechanism of PISAA with different population sizes n e {1,10, 30}. We observe that the 
{9^’^^} of PISAA have converged to the true values at any of the population sizes considered, and that the 
associated error bars are narrower for larger population sizes. Figure |4. lb| presents the MSEs produced by 
PISAA at different iteration steps, and for different population sizes. We observe that PISAA with larger 
population sizes has produced smaller MSEs throughout the whole simulation time. Yet, MSE decays as 
the iterations evolve; this behaviour, although not surprising, may be non-trivial due to the heterogeneous 


nature of the sequence {wt} (that is Wt A Wf, for t ^ t'). Figure 4.1c presents the progression of the MSEs 
produced by PISAA for different gain factor powers. We observe that MSE decreases when the population 
size increases. Furthermore, we observe that this behaviour is more significant for slower decaying gain 
factors -namely when the power of the gain factor is smaller and close to 0.5. 


Figure 4.Id presents the relative efficiency RE(k; /?) of the self-adjusting process estimator for the biased 
weights wif: as a, function of the population size k e {2,..., 30}, and for different powers of gain factors 
P e {0.55,0.65,0.75,0.85,0.95,1}. In serial computing environments, the computational cost can be defined 
as the iterations times the population size. For the computation of relative efficiency RE(k; /3), we considered 
constant computational cost, and hence PISAA with population size k ran for [u/kJ iterations, where n = 10®. 
In Figure [4.1d| the marks refer to the estimated relative efficiency, the dashed lines are lines with slop /3 — 1 
and refer to the theoretical behaviour of the relative efficiency (i.e. lg(RE(K;/3)) (/3 — l)lg(/c)) from 

Theorem |A.5| while the different colours correspond to different values of /3. We observe that the empirical 
results are consistent with Theorem |A.5| since the marks lie close to their corresponding lines. The efficiency 
of the estimates of the bias weights produced by the self-adjusting mechanism of PISAA improves as k 
increases. Thus, increasing the population size improves the stability of PISAA even in the case that a serial 
computing environment is used and a fixed computational budget is given. We observe that this behaviour 
is even more significant for slower decaying gain factors. 

The results support that, PISAA produces the ‘real’ estimates for w* as rj ^ r*, the MSE of those 
estimates reduces as k increases, and the efficiency of the self-adjusting mechanism improves as k increases. 
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(a) Estimate and 95% error bars of 
= 1 : 6} (at n = 10®). 


Pop. size 



(b) Progression curves of the MSE es¬ 
timate of for different popula¬ 

tion sizes. (/3 = 0.55). 
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Gain factor 
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0 0 = 0.75(sim) 
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. /3 = 0.95(sim) 
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0 = l(sim) 
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(c) MSE against the population size, 
for different power /3 of gain factor (at 
n - 10®). 


(d) Relative efficiency of the bias 
weight estimate as a function of the 
population size, for different power of 
gain factor. 


Figure 4.1: 


(Section 4.11 Estimates, MSEs, and relative efficiency of the bias weights {wn^ 


by PISAA at different iteration, population sizes, and power of gain factor /3. 
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4.2 Rastrigin’s function 

We test the proposed algorithm on a benchmark optimisation problem where the goal is to minimise the 
rotated Rastrigin’s function U 2 {-) 


U 2 {x) = Ra.{y{x)); 

(4.2) 

d 

Ra(2/) = lOd + V {yl - 10 cos(27ryfc); 

(4.3) 

k = l 


y{x) = Rx, 

(4.4) 


e X, on space X = [—5.12,5.12]'^, d g N — {0}, where Ra : X 


is the Rastrigin’s function (Torn 


and Zilinskas 1989, Miihlenbein et ah] 1991[ Liang, 20111, and i? is a rotation matrix generated according 


to the Salomon’s method, see details in (Appendix B in Salomon, 19961. The global minimum of (4.21 is 
Ra(a:,i=) = 0 at a;,,, = (0,..., 0), for d G N — {0} ( [Miihlenbein et ah 19911. 

Rastrigin’s function has been used by several researchers as a hard benchmark function to test experi¬ 


mental optimisation algorithms (Dieterich and Hartke 2012 Torn and Zilinskas, 1989 Miihlenbein et al. 


1991 Liang 2011 Liang et al. 2006 Ali et al. 20051. It presents features that can complicate the search 


for the global minimum: it is non-convex, non-linear, relatively flat and presents several local minima that 


increase with dimension; e.g. about 50 local minima for d = 2 (Ali et al. 20051. The rotation transformation 


(4.4) is a well established technique that transforms originally separable test functions, such as the Rastri¬ 
gin’s one, into non-separable. Non-separability makes the optimisation task even harder by preventing the 
optimisation of a multidimensional function to be reduced into many separate lower-dimensional optimisa¬ 


tion tastks. For instance, in (4.4), all the dimensions in vector y are affected when one dimension in vector 
X changes in value. 

Here, if not stated otherwise, we consider default settings for PISAA: (i) n = 10® iterations, (ii) uniformly 
spaced grid {uj} with m = 400, ui = —0.01, M 400 = 40, (iii) desirable probability with parameter A = 0.1, 
(iv) temperature ladder {tj} with Th = 1, = 1, r,,, = 10“^, (iv) gain factor {yt} with = 10®, = 0.55. 

One MCMC sweep is considered to be a random scan of mutation operations (Metropolis, hit-and-run, k- 
point) and crossover operations (fc-point, snooker, linear), with equal operation rates, and scale parameters 
calibrated so that the expected acceptance ratio to be around 0.234. 


In Figure 4.2a we present the average progression curves of the best function value (best value), discovered 
by PISAA for different population sizes kg {1,4, 5,14, 30}. We observe that by using larger population sizes, 
the algorithm quicker discovers smaller best values, and quicker converges towards the global minimum. The 
difference in performance between SAA (aka PISAA with k = 1) and PISAA using a moderate population 
size, such as k = 5, is significant. In Figure |4.2b| we plot the best value against the population size for 
different dimensionality of the Rastrigin’s function. We observe that PISAA discovers smaller best values as 
the population size increases for the same number of iterations. Increasing the population size improves the 
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performance of the algorithm significantly at any dimensionality considered, while it is particularly effective 
in large or moderate dimensionalities. We highlight that the most striking performance improvement is 
observed in the range of small population sizes. In Figure |4.2c[ we observe that the MSE of the bias weights 
approximated by the self-adjusting mechanism of PISAA becomes smaller when larger population sizes are 
used. This indicates that increasing the population size makes the self-adjusting mechanism of PISAA more 
stable. 

The performance of PISAA with respect to the grid size, for different desired probabilities, is presented 


in Figure 4.2d In particular, we ran PISAA with a large enough population size {k = 30) to ensure that all 
the subregions are visited. We observe that larger grid sizes lead to a better performance for PISAA, given 
that the population size is large enough. We observe that the choice of the desired probability has bigger 
impact for large grid sizes {m > 50) than for smaller grid sizes (m < 50). However, for any grid size, we 
observe that a moderately biased desired distribution (A as 0.1, ...,0.9) is preferable. The performance of 


PISAA against the population size for different desired probabilities is presented in Figure 4.2e We observe 
that increasing the population size is more effective for desired probabilities with (A as 0.1,..., 0.9). Hence, 
although biasing towards low energy subregions is preferable for optimisation problems, over-biasing can slow 
down the convergence towards the global minimum. Figure [4. 2f| presents the performance of PISAA against 
the population size for different grid sizes. The performance improvement of PISAA due to the population 
size increase becomes more significant when finer grids (larger grid sizes) are used. As mentioned, finer 
grids improve the exploration of the sampling space, however they require a more efficient self-adjusting 
mechanism to fight against possible larger variance in the approximation of {Ot} due to the increased number 
of subregions. Here, we observed that increasing the population size allows the use of finer grids, while it 
reduces the aforesaid consequence. 

We compare PISAA with VFSA using the same operations and temperature ladder as PISAA, and with 


AESAMC using the settings used by (Liang 20111, against the SOD Rastrigin’s function. In Figures 4.3a and 
|4.3b[ we plot the average progression curves of the best values discovered by each algorithm for population 
sizes K = 5, and 14 respectively. We observe that PISAA converges quicker to global minimum than VFSA 
and AESAMC in both cases. Figure [4.3c| presents the performance of the algorithms against the population 
size. We observe that increasing the population size improves the performance of PISAA, in terms of average 
best values discovered, significantly faster than the performance of VFSA and AESAMC. It is observed that, 
although the population size increases, VFSA and AESAMC stop improving after k = 10, while PISAA 
continues to improve even after k > 10 but at a slower rate. This is because the underline adjustment 
process of {wt} keeps on improving, in terms of variance, and converges faster as k increases. Therefore, 
PISAA outperforms significantly VFSA, and AESAMC. 
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(a) Average progression curves of the 
best function values, for different 
population sizes. 


(b) Average best function values 
against the population size, for dif¬ 
ferent dimensionality. 


(c) Progression curves of the MSE 
estimate of wt, in Ig-scale, for differ¬ 
ent population sizes. (0 = 0.55). 



(d) Average best function values as 
functions of the grid size, for differ¬ 
ent values A of the desired distribu¬ 
tion. 



(e) Average best function values as 
functions of the population size, for 
different values A of the desired dis¬ 
tribution. 



(f) Average best function values as 
functions of the population size, for 
different grid sizes. 


Figure 4.2: 


(Section 4.21 Performance plots of PISAA. The results reported consider averaged values 


over 48 independent runs. 
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(a) Average progression curves of the 
best function values generated by 
PISAA, AESAMC, and VFSA with 
population size 5. 


(b) Average progression curves of the 
best function values generated by 
PISAA, AESAMC, and VFSA with 
population size 14. 


(c) Average best function values gen¬ 
erated by PISAA, AESAMC, and 
VFSA against the population size. 


Figure 4.3: (Section 4.21 Average best values (averaged over 48 independent runs) discovered by PISAA, 


AESAMC, and VFSA. 


4.3 Protein folding 

Proteins are essential to the living organisms as they can carry out a multitude of biological processes, 
e.g. production of enzymes, antibodies etc. In biophysics, understanding the protein folding mechanism is 
important because the native conformation of a protein strongly determines its biological function. Predicting 
the native conformation of a protein from its sequence can be treated as an optimisation problem that involves 
finding the coordinates of atoms so that the potential energy of the protein is minimised. This is a challenging 


optimisation problem (Liang 20041, because (i) the dimensionality of the system is usually high, and (ii) 
the landscape of the potential energy is rugged and characterised by a multitude of local energy minima 
separated by high energy barriers. 

To understand the relevant mechanics of protein folding, simplified, but still non-trivial, theoretical 


protein models exist; among them is the off-lattice AB protein model (Stillinger et al. 19931. The off-lattice 
AB protein model incorporates only two types of monomers A and B, in place of the 20 that occur naturally, 
which have hydrophobic and hydrophilic behaviours respectively. The atom sequence Si, i e {2,3,...}, of 


a Ni-mer, can be determined by a Fibonacci sequence (Stillinger et al. 1993 Stillinger and Head-Gordon 


1995, Hsu et al. 20031 which is defined recursively as Sq = A, Si = B, Si = Si- 2 Si-i and has length given 


by the Fibonacci number Ni = Ni -2 + > 2. The atoms are assumed to be linked consecutively by 

rigid bonds of unit length to form a linear chain which can bend continuously between any pair of successive 
links. The chain can reside in the 2-, or 3- dimensional physical space which defines the 2D, or 3D off-lattice 
AB model, correspondingly. 


For the 2D AB model (Stillinger et al. 1993 Stillinger and Head-Gordon 1995 Liang 20041, the potential 
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energy is 


jSf-2 N-2 N 

U3.Ae2..N-l) = I] Ve{i)+ Y, E 

i = l 2=1 j = i-\-2 


(4.5) 


Vg{i) := 0.25(1 - u] ■ Ui+i),VLj{i, j) := 4(r,_y - C' 2 D(i, ), 

where C' 2 D(*,j) is 1, 1/2, and —1/2, for AA, BB, and AB pairs respectively, Ui := (cos(0i),sin(0i))‘'' is the 
nnit vector joining monomer i to monomer i + 1, Vij := rij( 62 -.N-i) denotes the distance between monomers 
i and j, and 9i = 0, e [0, 27r), for i = 2 ,..., N— 1, are polar coordinates. The dimensionality of the problem 


is d = N — 2. For the 3D AB model (Irback et al., 1997 Hsn et al., 2003 Bachmann et al. 2005 Kim et al. 


2005, Liang 20041, the potential energy is 

UsA^^-N-iAs-.N-i) = E E E E 


N-2 


N-3 


N-2 N 


(4.6) 


2=1 j=i+2 


Vs{i) := Ui ■ Ui+i, Vr(i) := -0.5(ui • Mi+ 2 ), VLj{i,j) := 4(r-J - C'3D(i, j’), 


■ 


where C' 3 D(*,j) is 1, for AA, and 1/2, for BB, and AB pairs, m := (cos(0i) sin(0i), sin(0i) sin((/j), cos((/i))‘'', 
9i is the azimnthal angle, and is the polar angle of Ui such that = (/i = 1/2 = 0, 9i e [0, 27r), (j>i e [0, tt], 
for 1 = 1 ,..., A^— 1. The dimensionality of the problem is d = 2N — 5. Here, for the purpose of demonstration, 
we concentrate on the 13-, 21~,34-, and 55- mers AB. 

We consider default settings for PISAA (valid if not stated otherwise): (i) n = 2 • 10^ iterations, (ii) 
uniformly spaced grid {uj} with m = 101, (iii) desirable probability with parameter A = 0.1, (iv) temperature 
ladder {tj} with = 10, = 10®, = 10“^, (iv) gain factor { 7 t} with = 10^, /3 = 0.55, and (v) 

MCMC transition probability with mutation operations (Metropolis, hit-and-run, fc-point) and crossover 
operations (/c-point, snooker, linear), equal operation rates, and operation scale parameters cr//(m 1 ), 
where a is calibrated so that the expected acceptance ratio to be around 0.234, and j is the label of the 
subregion the current state belongs to. Each experiment ran 48 times independently to eliminate possible 
variation in the output caused by nuisance factors. 

We examine the performance of PISAA as a function of the iterations and the population size. In Figures 
|4.4a| and |4.4d[ we illustrate the average progressive curves of the best values discovered by PISAA using 
different population sizes against the 55-mer 2D and 3D AB models. We observe that PISAA using larger 
population sizes converges quicker towards smaller average best values. In Figures [4.4b| and |4.4e[ we present 
the performance of PISAA with respect to the ‘best values’ discovered until the iteration n = 2 ■ 10^ as 
a function of the population size against the 2D and 3D AB models, respectively. In our simulations, we 
have considered the 13-, 21-,34-, and 55-mers AB sequences. We observe that increasing the population 
size of PISAA is particularly effective in longer AB sequences (and so higher in dimension problems), while 
moderate population sizes are adequate in shorter AB sequences (and so moderate in dimension problems). 
In fact, PISAA improves significantly as the population size increases in the high dimensional case of 55- 
mers, however it performs acceptably even with a moderate population size {k ^ 10 ) in the lower dimensional 
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(a) Average progression curves of the 
best function values, for different 
population sizes. 


(b) Average best function values 
against the population size, for dif¬ 
ferent lengths of polymer. 


(c) Progression curves of the MSE 
estimate of wt for different popula¬ 
tion sizes. (/3 = 0.55). 



(d) Average progression curves of the 
best function values, for different 
population size. 


(e) Average best function values 
against the population size, for dif¬ 
ferent lengths of polymer. 


(f) Progression curves of the MSE es¬ 
timate of Wt for different population 
sizes. (/3 ^ 0.55). 


Figure 4.4: (Section 4.31 Performance plots of PISAA. The results reported consider averaged values 
over 48 independent runs. The 1st and 2nd rows refer to the 2D and 3D AB models correspondingly. 


cases of 13-, 21-,34-mers. Compared to the standard SAA (aka PISAA with k = 1), PISAA (with /t > 1) 
presents significantly improved performance, in the 55-mer 2D and 3D AB models when the same number of 
iterations is considered. Note that increasing the population size of PISAA does not necessarily mean that 
the CPU time required for the algorithm to run increases significantly because PISAA can be implemented in 
parallel computational environment if available. In Figures [4.4c| and |4.4f| we observe that when PISAA uses 
larger population sizes, the bias weights generated by the self-adjusting mechanism of PISAA have smaller 
MSE, and hence the algorithm tends to present a more stable self-adjusting process. 

We compare the performance of PISAA with those of VFSA and AESAMC, against the 55-mer 2D, and 
3D off-lattice AB models. We run each simulation 48 times independently to eliminate possible variation 
in the output caused by nuisance factors. About the algorithmic settings: PISAA uses the aforementioned 
settings, VFSA shares common settings with PISAA, and AESAMC uses an equally spaced partition of 10^ 
subregions, temperature t = 1.0, and threshold values H = 10. VFSA and AESAMC use the same crossover 
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and mutation operations as PISAA. 

The results from the empirical comparison of PISAA, AESAMC, and VFSA associated to the 2D and 
3D AB models are summarised in the 1st and 2nd rows of Figure |4.5| respectively. Figures |4.5a[ |4.5b[ |4.5d| 


and 4.5e show the average progression curves, up to iteration n = 10^, generated by the algorithms under 
comparison. We observe that the average progression curves generated by PISAA converge quicker towards 
smaller ‘best values’ than those generated by AFSAMC, and VFSA. This behaviour is observed in both large 
population sizes {k = 30) and small population sizes (k = 3 and k = 10), in 2D and 3D AB models. PISAA 
does not appear to become trapped into local minima although, during the first iteration steps, the curves 
generated by PISAA reduce at a faster rate than those generated by AESAMC and VFSA. Possibly, the 
reason is because compared to AESAMC, PISAA uses a smoother shrink strategy towards areas of minima, 
while compared to VFSA, PISAA uses an enhanced self-adjusting mechanism. 

In Figures [4.5c| and |4.5f| we compare the performance of PISAA, AESAMC, and VFSA with respect to 
the averaged best values discovered as a function of the population size, in the 2D and 3D AB models. We 
observe that PISAA has discovered smaller ‘best values’ than AESAMC and VFSA for any population size 
considered in both 2D, and 3D AB models. However, if parallel environment is available, PISAA is expected 
to further outperform AESAMC, for a given budget of execution time, because at each iteration PISAA can 
generate the population simultaneously by using several CPU cores in parallel while AESAMC has to do it 
serially. Thus, we observe that PISAA significantly outperforms AESAMC and VFSA. 


4.4 Spatial imaging 

We consider an image restoration problem where there is need to remove the noise from a 2D binary image. 
The image under consideration was obtained from PNNL’s project supported by the U.S. Department of 
Energy’s Office of Energy Efficiency and Renewable Energy to improve advanced transportation technologies. 


The image is a gray-scale photo-micrograph of the micro-structure of the Ferrite-Pearlite steel (Figure 4.6a I, 
where the lighter part is ferrite while the darker part is pearlite. It can help us investigate how the micrograph 
of the microstructure of the Ferrite-Pearlite steel (and hence strength level) develops during hot rolling 


(Gladshtein et al. 20121, and therefore better understand how to control the strength of a strip steel. We 


focus our analysis on the first quarter fragment of size 240 x 320 pixels (red frame in Figure 4.6a[ ). Since the 
image is contaminated by noise, our purpose is to restore the original image x given the degraded (observed) 
image y. 

We employ the Bayesian image restoration model of (Besag, 1977 Geman and Geman, 1984[ Besag 19861 


which is based on the Ising model (Ising, 19251 and has posterior distribution with density 7r(a;|y) such that 

(4.7) 

where a > 0, and & > 0 are fixed parameters (here, a = 1.1, and b = 0.9). The symbol ‘~’ denotes the 


7T{x\y)cxiexp{a^l{y.}{xi) + b ^ l{3;^.}(a;j)). 
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(a) Average progression curves of the 
best function values discovered by 
PISAA, AESAMC, and VFSA with 
population size 3. 


(b) Average progression curves of the 
best function values discovered by 
PISAA, AESAMC, and VFSA with 
population size 30. 


(c) Average best function values dis¬ 
covered by PISAA, AESAMC, and 
VFSA against the population size 
(n = 2 ■ 10^). 


3D off lattice AB protein model (55-mer) 
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3D off lattice AB protein model (55-mer) 
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(d) Average progression curves of the 
best function values discovered by 
PISAA, AESAMC, and VFSA with 
population size 10. 


(e) Average progression curves of the 
best function values discovered by 
PISAA, AESAMC, and VFSA with 
population size 30. 


(f) Average best function values gen¬ 
erated by PISAA, AESAMC, and 
VFSA against the population size 
(n - 2 - 10^). 


Figure 4.5: (Section 4.31 Average best values (averaged over 48 independent runs) discovered by PISAA, 
AESAMC, and VFSA. We consider the 55-mer AB model in 2D and 3D, 1st and 2nd rows correspond- 
ingly. 
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(a) Gray scale image (480 x 640 pixels), and the (b) MAP estimate of image fragment (240 x 320 
fragment under consideration (240 x 320 pixels) pixels) 


Figure 4.6: (Section 4.41 Gray scale digital photo-micrograph of the micro-structure of the Ferrite- 


Pearlite steel, and the MAP estimate of its, red in colour, framed fragment. 


neighbourhood of the eight adjacencies (vertical, horizontal, and diagonal) of each interior pixel. In Eq. 4.7 


the first term is associated to the likelihood and encourages states Xi to be identical to the observed pixel j/i, 
while the second term is associated to the Ising prior model, encourages neighbouring pixels to be equal and 
hence provides smoothing. In this context, image restoration can be achieved by computing the maximum a 
posteriori (MAP) estimate of the original image which can be found by minimising the negative log posterior 
density, Ua{x) := — log(7r(a:|2/)) ( [Gernan and Gemaii 19841. 

Gomputational difficulties raise when algorithms based on standard MGMG samplers with component¬ 


wise structure of single-pixel updates are employed (Higdon, 19981. Such an update design tends to either 


converge slow or get trapped because the prior term in (4.71 strongly prefers large blocks of pixels. This issue 


becomes even more serious for large values of b which favour strong dependencies. Against this application, we 
compare the PISAA, VFSA, and PSAA. PSAA refers to the parallel SAA, a multiple-chain implementation 
of SAA that involves running a number of standard SAA procedures with the same algorithmic settings 
in parallel and completely independently. PISAA and PSAA use the following algorithmic settings: (i) 
n = 5 ■ 10^ iterations, (ii) uniformly spaced grid {uj} with m = 200, ui = —826315.5, uioo = —971500.5, 
(hi) desirable probability with parameter A = 0.1, (iv) temperature ladder {tj} with t/j = 5, = 10^, 

Tif = 10“^, (iv) gain factor {yd with = 10^, f3 = 0.55. The MGMG kernel of PISAA is designed to be 
a random scan of a Gibbs update (updating one pixel at a time) and fc-point crossover operations (where 
k = 2). VFSA and SAA use only Gibbs updates. 

We observe that PISAA discovers quicker smaller best values when the population size increases (Figures 


4.7a and 4.7c I. Figure 4.7c shows that PISAA converges quicker than PSAA as the number of the parallel 


chains involved increases. This implies that it is preferable to run a PISAA with a population size At > 1 
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(a) Average progression curves of 
the best function values discovered 
by PISAA with different population 
sizes. 


(b) Average progression curves of the 
best function values discovered by 
PISAA, PSAA, and VFSA with pop¬ 
ulation size 20. 


(c) Average best function values gen¬ 
erated by PISAA, PSAA, and VFSA 
against the population size. 


Figure 4.7: (Section 


4.41 Performance and comparison plots of PISAA, PSAA, and VFSA. 


rather than run k SAA procedures completely independent from each other. Moreover, it shows that the 
interacting character of PISAA is a necessary ingredient for significantly improving the performance of the 
algorithm by increasing the population size. By ‘interacting character’ of PISAA, we refer to the distinctive 
way that the crossover operations and self-adjusting mechanism of PISAA use the distributed information 
gained from all the population chains to operate. Moreover, we observe that PISAA outperforms VFSA 
(Figures |4.7b and 4.7c I. Finally, the MAP estimate of the original image as computed by PISAA with 
population size 30 is shown in Figure pi. 6b[ 


4.5 Bayesian network learning 

The Bayesian network ( Ellis and Wong[ |2008 1 is a directed acyclic graph (DAG) whose nodes represent 
variables in the domain, and edges correspond to direct probabilistic dependencies between them. It is 
a powerful knowledge representation and reasoning tool under conditions of uncertainty that is typical of 
real-life applications. Mathematically, it can be defined as a pair B = {G,p), where Q = (V,£) is a DAG 
representing the structure of the network, V denotes the set of nodes, £ denotes the set of edges, and p is 
the vector of the associated conditional probabilities. In the discrete case we consider here, V := {V; z = 1 : 
d} e V denotes a node that takes values in a finite set {vj]j = 1 : rz}, e N — {0} and hence V is assumed 
to be a categorical variable. Therefore, there are qi = Oy epa(y) G possible values for the joint state of the 
parents of V, where pa(Vi) denotes the set of parents of Vi node. In this example, we consider the prior 


model of Ellis and Wong (20081; Liang and Zhang (20091, and hence we focus our interest in the marginal 


posterior probability Pr(5|I?) such that 

b 




(4.8) 
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where V = {Vi]i = 1 : N) denotes the data set, considered to be IID samples, denotes the number 


of samples for which Vi is in state j and pa,{Vi) is in state k, = {i’iqi)~^ (Ellis and Wong 20081, and 
b e (0,1) (here, b = 0.1 ( Liang and Zhan^ [2009 1). The negative log-posterior distribution function, or else 
energy function, of the Bayesian network is U 5 {Q) := — log(Pr(^|I?)). 


Existing methods for learning Bayesian networks include conditional independence tests (Wermuth and 


Lauritzen 19821, optimisation (Heckerman et al. 19951, and MCMC simulation (Madigan and Raftery[ 1994 


Liang and Zhang, 20091 approaches. Often interest lies in finding the maximum a posteriori (MAP) putative 


network that can be performed by minimising the negative log-posterior distribution density Determ¬ 

inistic optimisation procedures often stop at local optima structures. Standard MCMC based approaches. 


although seemingly more attractive (Liang and Zhang 20091, are still prone to get trapped in local energy 
minima indefinitely. This is because the energy landscape of the Bayesian network can be quite rugged, with 
a multitude of local energy minima being separated by high energy barriers, especially when the network 
size is large. Here, we examine the performance of PIS A A against this challenging optimisation problem. 


We consider the Single Proton Emission Computed Tomography (SPECT) data set (Cios et al. 1997 


Kurgan et al., 20011, available at UC Irvine Machine Learning Repository that describes diagnosing of 


cardiac SPECT images. It includes 267 SPECT image sets (patients) processed to obtain 22 binary feature 
patterns that summarise the original SPECT images. Each patient is classified into two categories: normal, 
and abnormal. 

We examine the performance of PISAA as a function of the iterations and the population size, and 
compare it with those of PSAA, and VESA. PISAA uses algorithmic settings: (i) n = 2 • 10® iterations, 
(ii) uniformly spaced grid {uj} with m = 2001, ui = 2000, U 2001 = 3999, (iii) desirable probability with 
parameter A = 0.05, (iv) temperature ladder {rt} with = 50, = 1, r,,, = 10“^, (iv) gain factor { 74 } 

with = 10®, /3 = 0.55. The MCMC kernel is designed to be a random scan of mutation operations 


only (temporal order, skeletal and double skeletal suggested by (Liang and Zhang, 2009 Wallace and Korb 


19991) with equal operation rates. PSAA and VESA share common settings with PISAA. Each simulation 


runs for 48 times to eliminate output variations caused by nuisance factors. 

Figure [4.8a| presents the average progression curves of the best values discovered by PISAA at different 
population sizes. We observe that increasing the population size accelerates the convergence of the algorithm 
towards smaller best values. Figure [4.8b| shows the best function values discovered by PISAA, PSAA, and 
VESA using 30 chains each. We observe that PISAA tends to discover smaller best values quicker than 
PSAA and VESA. In Figure [4.8c| we present the best values discovered by the algorithms under comparison 
after 2-10® iterations as functions of the population size. We observe that PISAA has discovered smaller best 
values than PSAA and VESA. A reader, non-familiar to the Bayesian network modelling, might argue that 
the observed improvement in performance of PISAA due the population size increase is not that eye-catching 


in Figure 4.8c because of the decisively small slope of the curve. In Bayesian networks (Liang and Zhang 


^http://archive. ics .uci. edu/ml, unless changed 
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(a) Average progression curves of 
the best function values discovered 
by PISAA with different population 
sizes. 


(b) Average progression curves of the 
best function values discovered by 
PISAA, PSAA, and VFSA with pop¬ 
ulation size 30. 


(c) Average best function values 
discovered by PISAA, PSAA, and 
VFSA against the population size. 


Figure 4.8: (Section 


4.51 Performance and comparison plots of PISAA, PSAA, and VFSA. 


20091, even slightly different negative log-posterior probabilities can correspond to very different network 


structures leading to different statistical inferences. 

The MAP putative network computed by running PISAA with population size 30 and 2 • 10® iterations 


is shown in Figure 4.9 


5 Summary and conclusions 

We developed the parallel and interacting stochastic approximation annealing (PISAA) algorithm, a stochastic 
simulation procedure for global optimisation, that builds upon the ideas of the stochastic approximation 
annealing and population Monte Carlo samplers. PISAA inherits from SAA a remarkable self-adjusting 
mechanism that operates based on past samples and facilitates the system to escape from local traps. Fur¬ 
thermore, the self-adjusting mechanism of PISAA is more accurate and stable because it uses information 
from all the population of chains. Yet, the sampling mechanism of PISAA is more effective because it allows 
the use of advanced MCMC transitions such as the crossover operations. Furthermore, it breaks sampling 
into multiple parallel procedures able to search for minima at different sampling space regions simultan¬ 
eously. This allows PISAA to demonstrate a remarkable performance, and be able to address challenging 
optimisation problems with high dimensional and rugged cost functions that it would be quite difficult for 
SAA to tackle acceptably. The computational overhead due to the generation of multiple chains can be 
reduced dramatically if parallel computing environment is available. 

We examined empirically the performance of PISAA against several challenging optimisation problems. 
We observed that PISAA significantly outperforms SAA in terms of convergence to the global minimum 
as it effectively mitigates the problematic behaviour of SAA. Our results suggested that, as the population 
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4.51 MAP estimate 


Figure 4.9: (Section 
Gmap of the putative network, as compnted 
by running PISAA with population size 25 for 
2 • 10® iterations. {Us{Gmap) = 3026.935103) 


The data set considers 267 cardiac Single 
Proton Emission Computed Tomography 
(SPECT) images and particularly variables 
that corespond to features : 


The overal diagnosis, coded as ‘Overal 
diagnosis’, that is a class attribute with values 
‘normal’ and ‘abnormal’. 


The j-th partial diagnosis, coded as ‘Fj’, 
that takes values ‘normal’ and ‘abnormal’, 
where j = 1,..., 22. 


size increases, the performance of PISAA improves significantly in terms of discovering the global minimum 
and adjusting the target density. Precisely, when the population size increases, PISAA discovers the global 
minimum quicker, and the adjustment of the target density is more stable. More importantly, we observed 
that instead of running several SAA procedures completely independently, it is preferable to run one PISAA 
procedure with the same number of chains (or equiv. population size). In our examples, PISAA significantly 
outperformed other competitors, such as SA and ASAMC, and their population analogues, such as VFSA 
and AESAMC. In fact, it was observed that as the population size increases, the performance of PISAA 
improves significantly quicker than that of VFSA and AESAMC. 


Under the framework of PISAA, we showed that theoretical results of Song et al. (2014| for pop-SAMC 


regarding the asymptotic efficiency of the estimates of the unknown bias weights hold for PISAA as well. 


and presented theoretical results of Liang et al. (20141 for SAA regarding the convergence of the algorithm 


that hold for PISAA as well. The empirical results confirmed that PISAA produces correct estimates for the 
unknown bias weights w* as Tt ^ and that the efficiency of these estimates significantly improves as the 
population size increases. Moreover, the theoretical limiting ratio between the rates of convergence of their 
estimates generated by PISAA and SAA was also confirmed by our empirical results. 

Another important use of PISAA could be that of sampling from multi-modal distributions and then 
performing inference via importance sampling methods. PISAA can be extended to use an adaptive binning 


strategy for automatically determining the partition of the sampling space similar to (Bornn et al., 20131, 


or a smoothing method to estimate the frequency of visiting each subregion similar to (Liang, 20091. Of 
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particular interest would be to extend PISAA so that it can allow different partition schemes and desired 
probabilities for each population individual while ensuring the stability of the self-adjusted mechanism. 


Supplementary material 

Supplementary material for the article is available online. 

Appendix The appendix contains: 

• Theoretical analysis of PISAA. 

• The pseudo-algorithms of the MCMC kernel mutation and MCMC kernel crossover opera¬ 
tions considered in the examples (Section 
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Appendix 


A Theoretical analysis of PISAA 


The PISAA algorithm falls into the general class of the stochastic approximation MCMC (SAMCMC) al¬ 
gorithms. In order to study the convergence of PISAA, we adopt the technique developed by|Chen and Zhu| 


(19861. Traditionally, the convergence of such algorithms is studied by reformulating the equation in Step 2 of 


Algorithm[^as 9' = 9t-i + where 

is called the mean field function, and — h^^\6t-i) is called the observational noise. 

Similar to SAA, PISAA solves the integral equation h!'^^ {9) = 0 in the context of stochastic approximation, 
by solving sequentially the system of equations {h!'^\9) = 0; t = 1,2,...} defined along the temperature 
sequence {tj}. The idea is that if {r^} does not decrease too fast, the solution of = 0 can be used as 

an initial guess for = 0. Thus, in the limit, the convergence 9t —> 0=1! can hold under appropriate 

conditions, where 9:^: is the solution of the equation of interest. For mathematical simplicity, in what follows, 
we treat the temperature r e T as a continuous variable instead of a sequence, and assume that T' is compact, 
T = [t*, Ti]. For parameter 0 e 0, we assume 0 = K"* where m is the number of subregions. 

For PISAA, we have 

h[^\9)= r 

[- 2 Hr{9, fl 

'■ i=l 3 = 1 

= - V r i?,(0,xW)/,,,(:r«)dxW; 

'^^=l^x 

^ • 1 
1 = 1 

= hr{9), 


(A.l) 


I 


where is the mean field function of SAA (Liang et ah, 20141. Likewise, it is easy to show that 


Var,(,.) 

,Tt 


(Ct ) = (?t )■ Thus, for K e N — {0}, PISAA solves the same set of integration 

equations as the single-chain SAA, while reducing the variation in the mean field approximation. Note that, 
if K = 1, PISAA reduces to the single-chain SAA. 


A.l Conditions for PISAA 

The convergence of PISAA is studied under conditions (Ai - A 4 ) assumed for the mean field function. 


observation noise, gain factor, and temperature sequence. We recall from (A.l I that hi^\9) = hT{9) for 
K > 1. To easy the notation we suppress indexes and when no confusion is caused. 


(Ai) (Lyapunov condition) 
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(i) The function hr{0) is bounded and continuously differentiable with respect to both 9 and t, and 

there exists a non-negative, upper bounded, and continuously differentiable function VriO) such 
that for any A > <5 > 0, 

sup '\70VT{9)hr{9) < 0, (A.2) 

<5sgd((e,r),C)sgA 

where £ = {{9,t) : hr{9) = 0, 0 e 0, r e T} is the zero set of h^{6), and d{z,S) = infj^{|| 2 ; — y\\ : 
y G S}. Further, the set v{£) = {vr{0) ■ {9,t) g £} is nowhere dense. 

(ii) Both WgVr{9) and WrVr{9) are bounded over 0 x T. In addition, for any compact set K, (z Q, 
there exists a constant 0 < c < oo such that 

sup ||V 0 Ur( 6 ') - WeVr{9')\\ ^ c\\6 - 0'\\, 

(9,e')€KxK,T€T 

sup ||VeUr( 6 ») - V 0 Ur'( 6 »)|| ^ c|r - r'l, (A.3) 

eeK,(T,T')erxr 

sup \\hr{9) — hT>{0)\\ ^ c\t — t'\. 
eeK,{T,T')eTxT 

(Doeblin condition) 

For any given 0 e Q and t e T', the Markov transition kernel Pg^r is irreducible and aperiodic. In 
addition, there exist an integer /, 0 < <5 < 1 , and a probability measure u such that for any compact 
subset /C c 0 , 

inf pi Jx, A) ^ 6iy(A), MxeX, MAeBx, 

OelCxeT 

where Bx denotes the Borel set of X; that is, the whole support T is a small set for each kernel Pg^r, 
0 e K, and t g T. 

(Stability Condition on hr{0)) 

For any value r G T, the mean field function hT{0) is measurable and locally bounded on 0. There 
exist a stable matrix Ft (i.e., all eigenvalues of Ft are with negative real parts), p > 0, and a constant 
c such that, for any ( 6 *,j,t) g £ (defined in Ai), 

\\hT{0) - Ft{0 - 0*)II ^ c||0 - p, V 0 G {0 : IP - 0*11 ^ p}- 


(Conditions on {yt} and {rt}) 


(i) The sequence {yt}, which is defined to be y(f) as a function of t and is exchangeable with y(t) in 
this paper, is positive, non-increasing and satisfies the following conditions: 


£74 = 


00, 


7t+i - It 
It 


00 (l + i,')/2 

= O(ytPi), 2 ^ < 00, 

t=l 




(A.4) 


for some r G [1,2) and d G (0,1). 
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(ii) The sequence {rt} is positive and non-increasing and satisfies the following conditions: 


00 

lim Tt = rH=, Tt - Tt+i = V 7 t|Tt - Tt-il'- < oo, (A.5) 

»-G0 

t = l 

for some l'' g ( 0 , 1 ), and 

00 

Xi Ttkt - T-*| < 00 , (A. 6 ) 

t=i 

(iii) The function ((t) = is differentiable such that its derivative varies regularly with exponent 

/3 — 1 ^ —1 (i.e., for any z > 0 , as t ^ oo), and either of the following two 

cases holds: 

(111.1) 7 (t) varies regularly with exponent (—/3), | < /? < 1 ; 

(111. 2 ) For t > 1, 7 (t) = to/t with —2Xp^to > max{l,/3} for any t e T, where Xp denotes the 
largest real part of the eigenvalue of the matrix Fr (defined in condition A 3 ) with Xp^ < 0 . 


The Lyapunov condition (Ai) is related to the mean field function hr- The mean field function of PISAA 


is equal to that of SAA as shown in (ATI, and hence condition (Ai) can be verified as a consequence of 
Liang et al. (2014, p. 850). Briefiy given (A.ll, it is h!'r\o) = ~ ^ l,...,m) where si^\9) = 

and Sr^O) = ^^3 Sr ^ {9), which is bounded and continuously differentiable with 
respect to both 9 e Q and t e T- We defined the Lyapunov function Vr{9) = | (g) ~ which is 

non-negative, upper bounded, and continuously differentiable. The gradient X/eVr{9) is bounded over 0 x T, 


following Liang et al. (2007, p. 318); while VrVr(9) is bounded over 0 x T, provided that (7(x) has a finite 
mean with respect to frix). Yet, the second partial derivatives of Vr{d) with respect to 9 and r are bounded 


provided that U{x) has a finite variance with respect to fg^rix). Then, (A.2 1 is verified as in (Liang et al. 


20071, on the condition that the partition of the sampling space includes at least two non-empty subregions. 
The observation noise condition (A 2 ) is equivalent to assuming that the resulting Markov chain has a 


unique stationary and is uniformly ergodic (Nummelin, 2004 1 . It is not too restrictive for a PISAA whose 
function Hi^\9t-i, is bounded, and thus the mean-field function and observation noise are bounded. 

Condition (A 2 ) is satisfied if X is compact, U{x) is bounded, and the proposal distribution used to simulate 
from Pg r satisfies the local positive condition (Q): “There exists 6q > 0 and q > 0 such that, for every 
X e X, |a: — y| ^ ^ q{x,y) > 9 ”; following (Theorem 2.2 of Roberts and Tweedie, 19961. Condition (A 2 ) 

may also be verified in cases that X is not compact, e.g. (Rosenthal 1995| |. Multistep Metropolis-Hastings 
moves, such as those mentioned in Section]^ can be shown to satisfy (A 2 ); see (Lemma 7 of Rosenthal 


19951 and ( Liang[ 20091. If (A 2 ) holds for the single-chain kernel Pg^r, it must hold for the multiple-chain 


one as well; see (Supplementary material of Song et al. 20141. 

Condition (A 3 ) constrains the behaviour of the mean field function around the solution points. 
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Pelletier 


We remark that (A 4 )-(iii) can be applied to the usual gains 74 = , 1/2 < /3 ^ 1. Following 

(^) + l + 


(1998), we deduce that 


In terms of 74 , (A.7l can be rewritten as 

/ \ 1/2 

It ^ 


= 1 + C7t + o(7t). 


(A.7) 


(A. 8 ) 


V7t+i, 

where C = 0 for the case (iii.l) and ^ for /3 = 1 for the case (iii.2). Clearly, the matrix FV + C,I is 
still stable. Furthermore, condition (A 4 )-(ii) implies that {rt} cannot decrease too fast, and should be set 
according to the gain factor sequence { 74 }. A choice of r 4 = ;^ + t*, with ti > 0, satisfies (A 4 )-(ii). 

A.2 Main theorems hold in PISAA framework 

Under the conditions (Ai - A 4 ), the following theorems for the convergence of PISAA hold. Since Theorems 
|A.1| |A.2| and A.4 are applicable to both the PISAA and single-chain SAA algorithms, we let Xt denote the 
sample(s) drawn at iteration t and let X denote the sample space of Xt- For the PISAA algorithm, we have 
X = A” and Xt = For the single-chain SAA algorithm, we have X = A and A 4 = xt- For any 

measurable function /: X ^ Pgf{X) = ^^Pg{X,y)f{y)dy. 


Theorem A.l. (Restatement of Theorems 3.1 and 3.2 of Liang et al. (2014)) Assume that T is compact 
and the conditions {Af), (A 2 ), {A 4 )-(i) and {A 4 )-(ii) hold. If Oq used in the PISAA algorithm is such that 
sup^g. 7 -Ut(0o) < iiif||e||=co,TeT for some cq > 0 and ||0o|| < cq, then the number of truncations in PISAA 
is almost surely finite; that is, {Ot} remains in a compact subset of Q almost surely. In addition, as t —>■ co, 

t4{f^t .;^ 0 , a.s., 

where = {0 e 0 : hr,^, (O) = 0} and d{z, S) = infy{|| 2 ; — y\\ : y e S}. 

Theorem A. 2. (Restatement of Theorem 3.3 of \Liang et al. 1(2014}) ) Assume the conditions of Theorem 
m hold. Let xi,.. .,Xn denote a set of samples simulated by PISAA in n iterations. Let g: H ^ M. be a 
measurable function such that it is bounded and integrable with respect to fg^rix). Then 


n p 

y,9{xt)^ g{x)fg,^ ^r„{x)da 

t=i 


a.s. 


Therefore, given conditions (Ai - A 4 ) and following Liang et al. (2014 Corollary 3.1), PISAA can achieve 
the following convergence result with any individual: For any e > 0, as t —> cx), and r,,, —> 0 


P(C/(A 4 )^<-)-e|J(A 4 )=j)-l, a.s.. 


where J{x) = j ii x e Ej, and u* = U{x), for j = 1, ...,m. Namely, given a square-root cooling 

schedule, as the number of iterations t becomes large, PISAA is able to locate the minima of each subregion 
in a single run if r,,, is small. 
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Lemma A.3 concerns the decomposition of the noise in the PISAA framework. The proof of Lemma 

|A.3| is presented separably in Appendix |A.3| The importance of this lemma is that by using Lemma [A.3| 
the Theorems |A.4| and |A.5| can be proved to hold in PISAA framework as consequences of the results from 
(Song et al. 20141. Theorem A.4 concerns the asymptotic normality of Ot- With Lemma A.3 the proof 


of Theorem A.4 can be referred to the proof of (Theorem 3, Song et al. 20141 except for some notational 


changes, replacing h{9t) by Theorem A.5 concerns the asymptotic relative efficiency of the PISAA 


estimator of 9t versus that of SAA. The proof of Theorem A.5 is the same as that of (Theorem 4, Song et al. 


20141 using Theorem A.4 and Lemma A.3 


Lemma A.3. (Noise decomposition) Assume the conditions of Theorem A.l hold. Then there exist 
valued random processes {et}, and defined on a probability space (Tl,iF,V) such that: 

(i) ft+i = et+i + vt+i + ^t+i, where ^t+i = Hr^+^{9t,Xt+i) - hrt+^{9t) is the observation noise. 

(ii) For any constant p > 0 (defined in condition A 2 ), 


supT;(||et+i||“|J't)l{[|e^_e^ll^p| < 00, 
t^o 

where Tt is a family of a-algebras satisfying a{9o,XQ; 9i,Xi ]...; Xj} = Xt c Xt+i for allt^O and a > 2 
is a constant. 

(Hi) Almost surely on A(0,j,) = {9t as n —>■ 00 , 

1 " 

- 2 ^ r, a.s., (A.9) 

where T is a positive definite matrix. 

(iv) E{\\vtf /'^t)l{\\et-s^\\i:p} ^ 0 , as t^ CO. 

(v) E\\jt‘;t\\ ^ 0 , as t^ CO. 


Theorem A.4. (Consequence of (Theorem 2, \Song et al. 2914) and Lemma A.S) Assume that T is com¬ 
pact and the conditions (Ai), (A2), (A3) and (A4) hold. If 9 q used in the PISAA algorithm is such that 
sup,-g 7 - Ut(0o) < iiif||6i||=co,T€r for some cq > 0 and ||0o|| < cq, then, Conditioned on h.{9,^) = {9t 0*}, 

9t — 6** 




•Af(0,S), 


with denoting the weak convergence, Af the Caussian distribution and 


S = 


^00 

Jo 


(A.IO) 


(A.ll) 


where is defined in (A 2 ), ( is defined in (A.8), and T is defined in Lemma A.S 


Theorem A.S. (Consequence of (Theorem S, Song et al. 291))) Suppose that both the population PISAA 


(with pop. size k) and single-chain SAA algorithms satisfy the conditions given in Theorem A.4 Let 9^ and 
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91 denote the estimates produced at iteration t by the multiple-chain PISAA and single-chain SAA algorithms, 
respectively. Given the same gain factor sequence { 74 }, then {9f — ~ 0*)/ ■s/K'yKt have the 

same asymptotic distribution with the convergence rate ratio 


^ 


(A. 12) 


where k denotes the population size, and /3 is defined in (A 4 ). [Note: 1/2 < /3 < 1 for the case A/^-(iii.l) 
and /3 = 1 for the case A4-(iii.2).j 


A.3 Proof of theoretical results 

In order to prove Lemma[A.3| we introduce Lemma [A. 6 | which is a restatement of Lemma 1.1 of|Liang et al.| 


2014 

online supplement) and Proposition 6.1 of 

Andrieu et al. 

(2005 


Lemma A.6 . (Restatement of Lemma 1.1 of Liang et al. (2014. online supplement) and Proposition 6.1 of 


Andrieu et al. 1(2005 )) Assume that T is compact and the condition {A 2 ) holds. Then the following results 
hold for the PLSAA algorithm: 

(Bi) For any 0 G 0 and t £ T, the Markov kernel Pg^r has a single stationary distribution fg^r- In 
addition, H : Q x X Q is measurable for all 9 B Q and t bT, \ Hr{9, x)\fe ^r{x)<lx < 00. 

{B2) For any 9 B Q and t bT, the Poisson equation ug^T.{X) — Pg ^-ug^riH) = Ht.(9,X) — hr{9) has a 
solution ug^r{X), where Pg^rUg^riX) = ug^r{y)P 0 ,T{X,y)dy. For any constant rj B (0,1) and any compact 
subset /C c 0, the following results hold: 

(i) sup {\\ug^r{-)\\ + \\Pe,rUg^r{-)\\) < (Xi, 

eeK.reT 

(n) sup ||6» - 9 '\\~^ {||Me,r(-) - Ug'^r{-)\\ + \\Pe,rUg^r{-) - Pe',rUgi ^r{-)\\} < CO. 

(e, 9 ')eKxK,Ter 


(Hi) sup \t — T' 

eeK,{T,T')eTxT 




Po.rng^ri'} Po,T'ng^r'(‘(\\ ^ CO. 


{B3) For any rj B (0,1), 


sup ||0 — 0 '|| —/it-( 6 *')|| < 00 . 

(e,e')€0xe 
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Proof of Lemma IA.3I 


Proof, (i) Define 


6t+l — + i (s^t + l) P9t,Tt + iU0t,Tt + iixt) 


7i + l 

"I [^6(t4.i,rt+2^St + i,rt+2 (^^t + l) ~ P9t+l,Tt + l1^0t+l,Tt + l{^t + l)^^ 

7i + l 

^t+l ~ lt + lP9t,Tt + l'U9t,Tt + l{^t)t 
1 


(A.13) 


<^4+1 — 


7t+i 


(■it+i ~ ^t+ 2 ), 


where it(-) is the solution of the Poisson equation. It is easy to verify that ft+i = et+i + J^t+i + “tt+i holds, 
(ii) By ( |A.13| ), we have 


E{st+i\Pt) — P('aet,rt+i (Ai_|_i)|Pt) P9t,Tt+i'^9t,Tt+ii.^t) — 0, 

Hence, {et} forms a martingale difference sequence. Following from Lemma [A. 6} (P 2 ) ? we have 


supP(||et+i||“|Pt)l{||et-e,,||=sp} < QO. 

t;sO 


(A. 14) 


(A. 15) 


This concludes part (ii). 


(iii) By (A.131, we have 


P(e4+ief+i|Pt) = E [ug^{Xt+i)u9,{Xt+if\Pt] - P9,ue,{Xt)P9^ue,{Xtf 


= 1{X,). 


(A. 16) 


It follows from Lemma [A . 6 } (P 2 ) that l{Xk) is bounded, and then it follows from Theorem A.2 that 

1 n p 

-J]KXt)^jJ{x)f 9 ^,T^{x)dx = T, a.s. (A.17) 


for some positive definite matrix T. This concludes part (iii). 
(iv) By condition (A 3 )-(i), we have 


74+2 - 74+1 

74+1 


~ ^( 7 ^+ 2 ); 


for some value t e [1,2). By (A.13) and (P 2 ) of Lemma A .6 there exist constants ci, c( and 77 e (0.5,1) 


such that the following inequality holds, 

||t'4+i|| < ci||0t+i — 9t\\ + 0{'jf_f_2) + c)|rt+i — rt+2|'' = Ci\\'yt+iHrt^^{9t, Xt+i)\\ + 0(74^+2) + 
which implies, by the boundedness of Hr {9, •), that there exists a constant C 2 such that 


1^4 + 1 II < C 274 + 1 + o( 7 j\i). 
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Therefore, 


This concludes part (iv). 

(v) A straightforward calculation shows that 

It + l^t+l = 'it+l - 'ft + 2 = 7t+l'Pet.Tt + lU6(t,Tt+i(At) - 74 + 2^6»t+i,Tt + 2 ^04 + 1.1-4 + 2 (-’^t + l)) 

By (B 2 ), E [||P 6 it,rt+iii 04 ,r 4 +i(At)II] is uniformly bounded with respect to t. Therefore, (v) holds. □ 


Proof of Theorem IA.4I 


Proof. With Lemma A.3 the proof of this theorem can be referred to the proof of (Theorem 2, Song et al. 


20141 except for some notational changes, replacing h{9t) by hr^^^iOt). 


□ 


Proof of Theorem IA.5I 


Proof. The proof of this theorem is the same as that of (Theorem 3, Song et al. 2014 1 , with using Theorem 
IA.4I and Lemma IA.3I □ 
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B MCMC kernel crossover operations used in Section 


Let K denote the population size of the population and d denote the number of dimensions of each 

individual for i = 1 ,k. 


The pseudo-codes of the MCMC kernel crossover operations, used in Sections 4.1 - 4.4 are presented 
below. More details can be found in ( Liang[[2011 Liang and Wong[|2000[ 20011. 

• fc-point crossover operation (continuous or discrete target distributions): 


1 . draw i ~ and j\i ~ 

2 . draw crossover points vector v ~ {1,..., d — 1}, without replacement and sort them 

3. design x'^’''^ and x'^^^ from and x'^^^ by swapping their elements between each odd and the 
next even crossover points 


4. accept := {x^^''‘ withprob. okc 


min(l, 




• Snooker crossover operation (continuous target distributions): 


1. draw i ~ ■n7®'^(i; and j\i ~ W 2 ^{j\i\x^^''^'^) 

2. compute = x^'^'> + o'sc^SC ’ where rsc ~ N(0,1) 

II II 2 

3. accept := (a;^^-®“^\ a;(®+^-”)) with prob. asc = min(l, (^(i)|g) ) 

• Linear crossover operation (continuous target distributions): 


1. draw i ~ vj\^{i]X^^''^^) and j\i ~ tJ72®(j|i; 

2. compute x'^®^ = x^®^ + ri,cx^^\ where rue ~ U(—1,1) 

3. accept := (x(^'®“^\ x'(®\ x^®"'"^'®®^) with prob. olc = min(l, ) 


For the crossover operations, 



x(i^ 

:k)) ^ 

.^KC 

W2 

01* *; 

x(i^ 

:k)) ^ 


x(i^ 

:k)) ^ 

.^sc 

W2 

01*; 

x(i^ 

:k)) ^ 

LC/ • 

U7i [i; 


:k)) ^ 

.^LC 

072 

01*; 

x^^ 

:k)) ^ 


ve considered probabilities: 

exp(-C(x(®))/TKc) 

Iiv£ exp(-[/(xW)/rKc) ’ 
exp(-t/(x(®))/TKc) 
i;v£^iexp(-[/(xW)/rKc)’ 
1 

5 

K. 

exp(-t/(x(®))/Tsc) 
i;v^^iexp(-[/(xW)/rsc) ’ 
1 

5 

K 

exp(-[/(x^^^)/rLc) 

i;v£^^exp(-;7(x(^))/rLc)’ 


is 11,..., , 

j G IjiH- 1 ,..., , 

z G {1,, 

j G l,i + 

Z G 1 1 , ..., , 

j G IjZH- 1,..., , 
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with quantities tkC) tsCj and tlc equal to 0.1, in Section]^ 

The pseudo-codes of the MCMC kernel mutation operations, used in Sections |4.1|- [T3] are given below. 


More details can be found in (Smith 1984 Chen and Schmeiser 1993 , Liang 2011 Metropolis et al., 19531. 

• Metropolis mutation operation: 

For i = 1,..., k: 

1 . compute = x^^'> + (Tmrw^mrw where tmrw ~ N(0 , Id) 

2 . accept := with prob. omrw = niin(l, 

• Hit-and-run mutation operation: 

For i = 1,..., k: 

1 . compute + cr^j^ruReHR, where thr ~ N(0 ,1) and chr is drawn randomly from a unit 

d-dimensional space 

2 . accept := with prob. ohr = min(l, 

• fc-point mutation operation: 

For i = 1 ,..., k: 

1 . compute x'^'^'^ = a:^*^ + crKM’’KMeKM) where tkm ~ N(0 ,1) and ckm is a fc < d aces 0-1 d- 
dimensional vector randomly drawn 

2 . accept := with prob. okm = min(l, 


The Gibbs update (updating one pixel at a time) in the Spatial imaging example in Section 4.4 is given 
below. 

• Gibbs mutation operation in Section [4.4| 

For i = 1,..., k: 


1 . draw j randomly in { 1 ,..., d} 




(i) 


2. draw ~ Bernulli(t:t7G/(j; where WGi{j-,x^^">) = (1 -f (p - q) , (p 






The pseudo-codes of the MCMC kernel mutation operations, used for the Bayesian network example in 


Section 4.5 are given below. More details can be found in (Liang and Zhang 2009 Wallace and Korb 19991. 

• Temporal order operation: 

For i = 1 ,..., k: 

1 . compute by swapping the order of two randomly selected neighbouring nodes; if there is an 
edge between them, reverse its direction. 
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2 . accept g'W, with prob. cto = min(l, 


• Skeletal change: 
For 1 = 1,..., k: 


1 . compute by adding or deleting an edge between a pair of randomly selected nodes. 

2 . accept g'W, gb+i:«)) with prob. asc = min(l, 

• Double skeletal change: 

For i = 1, 


1 . compute by randomly choosing two different pairs of nodes, and adding or deleting edges 
between each pair of the nodes. 

2 . accept = (^(i—^b+i-)) with prob. ods = min(l, 

Remark B.l. The scale parameters of the proposals of the operations were tuned during pilot runs using the 
adaptation scheme: 

log(cr^j^^) <— log(cr^j^^) + [omrw —0.234]; this ensures that the associated expected acceptance probabilities 
will be around 0.234. In our applications, the performance of this adaptation scheme was acceptable, however 


more sophisticated schemes can be used. For more adaptive Metropolis-Hastings schemes see (Andrieu and 


Thoms 20081. 
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